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Abstract 



We propose a scalable, e fficient and stati s tically motivated computational frame- 
work for Graphical Lasso ( Friedman et al. . 2007bl ) — a covariance regularization 



framework that has received significant attention in the statistics community over 
the past few years. Existing algorithms have trouble in scaling to dimensions larger 
than a thousand. Our proposal significantly enhances the state-of-the-art for such 
moderate sized problems and gracefully scales to larger problems where other algo- 
rithms become practically infeasible. This requires a few key new ideas. We operate 
on the primal problem and use a subtle variation of block-coordinate-methods which 
drastically reduces the computational complexity by orders of magnitude. We provide 
rigorous theoretical guarantees on the convergence and complexity of our algorithm 
and demonstrate the effectiveness of our proposal via experiments. We believe that 
our framework extends the applicability of Graphical Lasso to large-scale modern 
applications like bioinformatics, collaborative filtering and social networks, among 
others. 
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1 Introduction 

Problem Description Let S pxp denote a p-dimensional sample covariance matrix ob- 
tained through i.i.d samples from a multivariate Gaussian distribution with (unknown) 
covariance £ and precision matrix The negative log-likelihood is given by: 

/(0) := -logdet0 + (S,0) on y 0, (1) 
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where (S, 0) := tr(S0) and corresponds to the precision matrix. The MLE (when it 
exists) is = S _1 , but this estimator has high variance unless the sample size n is large 
relative to the dimension p. This makes the MLE a not-so-useful estimator of the covari- 
ance/precision matrix. In such high dimensional problems regularization (smoothing) is 
imperat ive to obtain reliable es t imates. In fact, for the Gaussian distribution the precision 
matrix (ICox &: Wermuthl . Il996t iLauritzenl . Il996l ) captures conditional dependencies among 
variables where absence of an edge (zero entry in the precision matrix) implies conditional 
independence. Hence, taking recourse to smoothing methods that induce sparsity is at- 
tractive. In addition to producing shrinkage estimators, a sparse precision graph leads 
to interpretable models and also provides model compression. In the context of learning 
large-scale graphs it is often undesirable from the point of view of computational/storage 
considerations to learn a model with all possible p 2 edges present. Surprisingly enough, 
for large scale problems i.e. with p w 10 4 — 10 6 , sparse precision graphs are compu tation- 
ally tractable, whereas their dense counterparts are not (IMazumder fc Hastid . I201U see for 
details,). 



The li regularization (Friedman et al.l . l2007bl ; iBanerjee et al.l . 120081 ; lYuan fc Lid . 12007 



Meinshausen fc Biihlmann , 20061 ) is often used in this context since it performs smoothing, 
induces sparsity, adds interpretation and forms an effective model selection procedure. 
This is popularly known as sparse inverse covariance selection or the Graphical Lasso and 
is obtained as a solution to the following regularized criterion: 



minimize q(&) := f (0) + A 



E 



0; 



(2) 



where A > is the amount of regularization imposed on the entries of the precision matrix 
= ((%))• Equation (j2J) is a convex optimization problem (Semi-Definite Program aka 
SDP) in the variable 0. The class of models described through equation ([2]) has already 
gained widespread interest in many statistical applications like biost atistics, functional mag- 
netic resonance imaging, network analysi s, collaborative filtering ([Friedman et all [2007b; 



Huang et al.l . I201CH : lAgarwal et al.l . 1201 ll ). and many more. Considerable progress has 



also been made in studying the sta t istica l properties of the estimator and its variants 
(IRavikumar et all 1201 ll ; lLam &: Fad . |2009_|) . We also note th a t the optimization in (|2]) is 
often used in a more non-parametric fashion (lAgarwal et al.l . l2011t IMazumder &: Hastid . 
201 ll ) for any positive semidefinite input matrix S, not necessarily a sample covariance 
matrix from a MVN sample. 



Context and Background Interior point methods for solving (J2J) scale poorly with in- 
creasing dimensions and quickly become infeasible for problem sizes around a hundred. For 
scalability purposes, first order methods relying on gra dient informatio n instead of Hessian 
(i.e. second order methods) become almost imperative (INesterovl . 120031 ). Over the past few 
year s there has been substantial interest in developing such specialized scalable solvers for 
OWjFriedman et all 120075 iBaneriee et all 120081 : ILuL 120091 . l20ld : IScheinberg et all 12010 : 



Yuad . 120091 : iBoyd et all l201ll ). However, existing solvers have difficulty in scaling to prob- 
lems with p > 1000 — precluding the wide-spread use of these methods in modern day 
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applications like collaborative filtering, graph mining, web-applications, large microarray 
data and other high dimensional problems. 

There have been o t her intere s ting f ormulations to sparse precision matrix estimation 



(IRothman et all I201CH ; ICai et all l201ll ; iMeinshausen fc Biihlmannl . 120061 ; IFriedman et al. 
20101. for example,). The formulation of IRothman et al.l ( 120101 ) is non-convex. ICai et al. 



(1201 ll ) consider a linear programming approach where the p recision matrix estima te need 
not be positive definite. Pseudo-likelihood based approaches ( IFriedman et all 120101 ) do not 
ensure positive defmiteness of the matrices. In this paper we focus on ((21). 



Motivation Several large scale covariance selection problems require algorithms that 
produce reasonably accurate approximations to the optimal so lution of ([2]) within a certain 



time limit or equivalently, a limit on the computational budget ( IBottou &: Bousquetll2008l ) 



In fact, for large scale problems, under computational constraints an approximate solu- 
tion to (|2J) is often the only feasible option. But for several applications, other than speed 
and scalability, it is necessary for the approximate solution to retain crucial properties of 
the optimal solution like sparsity and positive defmiteness. O ne such applica t ion o f large 
scale covariance selection was recently explored in the work of lAgarwal et al.l (120 111 ). The 
authors used a sparse inverse covariance regularization to estimate the covariance matrix 
of high dimensional random effects in a multi-level hierarchical model. The paper explored 
prediction problems in recommender systems where the goal was to predict responses on 
unobserved user-item cells in a large matrix using responses on observed user-item pairs. 
Each user u is assigned an M dimensional random vector (f) u that represents the user's 
latent affinity to M items. u 's are assumed to be drawn from a multivariate Guassian 
prior with unknown covariance. The covariance is estimated via a E-M framework using 
an £i regularization on the elements of precision matrix. 



The use of a sparse inverse covariance regularization in the paper (lAgarwal et al.l . 1201 ll ) 
led to a model with better predictive accuracy compared to other state-of-the-art meth- 
ods. Since the estimation is based on an E-M strategy, it requires strictly positive definite 
estimates of the covariance and precision matrix. Indeed, an optimization of the form (121) 
needs to be conducted in the M-step — hence it is enough to terminate the process early 
without complete optimization. Early stopping along with sparsity in the precision matrix 
leads to a drastic reduction in computation time. The key property of covariance estimation 
required for the method to work is the ability to return both the precision matrix and its 
exact inverse i.e. the covariance matrix. These properties, apparently are not possessed by 
existing algorithms for (|2|). This may have precluded the use of sparse inverse covariance for 
estimating covariances in high d imensional random-effects model. We note that the strat- 
egy used in lAgarwal et al.l (120111 ) using the model fitting method described in this paper is 
general and can be used to model covariance in other high-dime nsional multivariate random 
effect s model that arise in applications like spatia l statistics (IBernardinelli fc Montomolil . 
19921 ). social networks dflofll 120091 ; iHoff PDlEooi ). and many more. 

In the scenarios described above, we want a 'flexible' fitting algorithm for (J2J) such that: 



It can deliver a solution of arbitrary accuracy to (J2j) 
demands of the user/ application. 



the accuracy depending upon 
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• Even if a low accuracy solution is desired, the algorithm upon exiting should return 
a sparse and positive definite and its inverse _1 — fundamental ingredients for 
relevant statistical model fitting procedures. 

• The computational complexity per iteration of the algorithm is cheap enough to solve 
large scale problems. 

• It readily adapts to warm-starts for computing a path of solutions on a grid of A 
values. 

We believe estimation procedures for inverse covariance described in this paper will make 
it routine to apply large scale covariance selection to high-dimensional multivariate data. 



Our Approach We provide a brief outline of our approach and the salient features that 
make it differe nt from other existing al gorithms. Many of the sophist i cated state-of-the- 



art a lgorithms (IBaneriee et all . 120081 : iLuL 120091 l2010t IScheinberg et all . l2010t iBovd et al. 



20111 ) designed to solve (j2J) perform expensive operations like matrix inversions / eigen- 
decompositions on the entire matrix at every iteration requiring 0(p 3 ), which is clearly 
prohibitive for large problems. We take a different route by pursuing row/column block- 
coordinate based methods that cyclically update t he estimates of o n e row / c olumn at a time 
fixing the others at their latest values. Although Friedman et al. ( 2007bl ); Banerjee et al 



(120081 ) also pursue block-coordinate methods, our approach differs in a few very important 
ways. 

First, while we operate on the primal (where th e primal variable is the precision matrix 
©), [Friedman et al.l ( 12007bl ); iBanerjee et al.l ( 120081 ) operate on the dual of ([2]). The primal 
and dual problems have some subtle and important differences that ne ed consideration 



for large scale st a tistical applicat i ons. Algorithms operating on the dual ([Friedman et al. 



2007bl ; iLul . l2010t IBanerjee et al.l . 120081 for example) do not return a sparse and positive 



definite precision matrix unless optimization is done to a very high degree of accuracy - 
this may be prohibitive for large scale problems. A more detailed discussion of this issue is 
provided in Section |6j 

Second, we track both the precision and the covariance matrix over iterations, and our 
row/column block-coordinate wise procedure does not perform a complete minimization 
over the partial problems. This is a crucial observation since it reduces the row/column 
update cost from 0(p 3 ) to 0{p 2 ). Although the idea looks simple at first blush, such 
incomplete minimization over partial problems is not necessarily guaranteed to ensure a 
proper optimization algorithm with convergence certificates. In Section[3]we show that such 
a relaxation still guarantees convergence of our algorithm. To the best of our knowledge, 
such a convergence analysis is novel both in the statistics and optimization literature. 



Contributions We provide a summary of our main contributions before a detailed de- 
scription of our approach. We propose a novel model fitting algorithm for a popular co- 
variance selection method ([2]) that outperforms previous state of the art fitting algorithms 
for large problems with dimension p « 1 — 3 thousands. The Algorithm design requires 
new and novel ideas. Our proposal is particularly suited to compute a path of solutions 
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to (J2J) by using warm-starts on a grid of A values. The performance gains are quite im- 
pressive when compared to other existing algorithms for the same task as illustrated in 
Section [7J In addition, our algorithm is amenable to early stopping, provides a sparse and 
positive-definite solution, and scales to very large scale problems that are impractical to 
fit using existing methods. We provide a novel proof of asymptotic (algorithmic) conver- 
gence analysis, analyze complexity of the method and show the superiority of our methods 
through large scale simulation and data analysis. Finally, we outline how our approach can 
accommodate other row/column separable convex regularizers. 



2 Algorithmic Framework 

We now provide a detailed development of our fitting algorithm in this section, including 
convergence proof and computational complexity analysis. We begin with notations. 



Notations We denote the set of all k x k positive definite (respectively, positive semi- 
definite) matrices by S£ + (respectively S£). We will write A kxk y if A e S£ + , similarly 
A y implies A e S£- For a matrix A pxp we will denote its entries by , i — 1, . . . , p; j — 
1, ... ,p. 

For a vector u, the notation ||u|| 2 denotes the usual £2 norm, ||u||i denotes the i\ norm. 
For a matrix U, we will use 1 1 X_J 1 1 2 to denote its spectral norm i.e. the largest singular value 
of U. 

Description of the Algorithm The block coordinate method operates by fixing a row/- 
column index % e {1,2, ... ,p}, which without loss of generality, is assumed to be p. Parti- 
tion the precision matrix and the sample covariance matrix S as follows: 

@ =( @ 9 n M> s= (s n s lp V ^ 

\ "pi "pp J \ °pl *pp / 

Using standard formulae for determinants of block-partitioned matrices we have: 

logdet(e) = logdet(0 n ) + log(0 w - O'^Q^O^). (4) 

Using the above, the part of g{&) in equation ([2]) that depends upon the p th row/column 
of © is given by: 

g P (0 pp , 6 lp ) = - \og{6 pp - e' lp (& u )- 1 lp ) + 2s' lp 6 lp + (s pp + \)9 PP + 2A||6> lp || 1 . (5) 

Note that the positive-definiteness of assures 6 PP > 0. In (jSJ), the opti mizatio n vari 
ables are 9 PP and 0-\ p . Conventional forms of block coordinate descent (jTsend . 12001 



Friedman et al.l . l2007al ) when applied to this problem will require completely minimizing 
the function (jSJ) over the variables 0\ p and 9 PP . Clearly for large problems an accurate 
optimization of this problem is quite computationally intensive, especially since this needs 
to be done several times across all rows/columns. We choose to deviate from this approach 
and propose to perform an inexact minimization in the afore-mentioned stage. The fact 
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that such a deviation still ensures a proper optimization procedure will be discussed later, 
for now we continue with the description of the algorithm. 

Minimizing the criterion fl5]) with respect to 9 PP with other coordinates fixed gives: 

9 pp :=argmin g(9 pp ,0 lp ) = 1/ (s pp + A) + flUQn) -1 ^. (6) 

a 

P P 

The partially minimized objective (j5J), w.r.t. 9 PP is given by: 

min g p (e pp ,6 lp ) = logdet(s pp + A) + 2s' lp lp + 1 + (s pp + X)e' lp (& u )- 1 lp + 2A||0i p ||i. 

Ignoring the constants independent of lp above, we obtain an l\ regularized quadratic^ 
function, which we denote by: 

g p (0 lp ) = 6' lp {(s pp + \)& u l }6 lp + 2s' lp lp + 2X\\d lp \\ 1 . (7) 

We propose to use one sweep of cyclical coordinate-descent on this function g(#i p ), w.r.t. 
the variable 0\ p . 

We now summarize the update rule described above. Fix an arbitrary 0^0, 

§=(!" }') (8) 

and consider an increment in around in the direction of the p th row/column. This 
updates to 0: 

0^0 + (we; + e p u/) (9) 

where e p is a vector in 3ft p , with all entries equal to but the p th entry equals to 1, 
uj = (oux, . . . ,uj p ) denotes the "increment" in the direction of the p th row/column. Using 

Algorithm 1 Inner Block Inexact Coordinate-Descent 

1. Initial value of is 0. Assign = 0. 

2. Update the entries Ui, . . . , oj p -\ and also 0i p , as in ffTOj) and ffTTj) . 

3. Update cu p using the update-rule f|T3|) . Consequently change the (p,p) th entry of 
to uj p + 9 pp . 



notation <Q and g p (-), ■ G 3fJ p 1 as in ([7]), the update rule in oj is given by: 

Wi = argmin g p {. . . , (Oi p )i + Wi, (6ip)i+x, ■ ■ •) (10) 

(0i p )i <- (0\p)i + ^i, «- (^pi)i + tDj, z = 1, . . . ,p - 1. (11) 



1 note that the problem is convex only if 0^ y 0, which is the case by virtue of the positive dcfinitcncss 
of the precision matrices, as shown in Section \2A\ 



Observe that the update (ITU]) is simply a soft-thresholding operation: 

u)i =sgn(a«)(|oi| - \)+/bi - {0 lp ) h where, (12) 

ai =(si P )i + (s pp + A) I ^2(&u)ij(6ip)j + ^2(®Iih(0i P )j J , bi= {s pp + \){®ii)u 

\j<i j>i / 

Finally, upon updating the off-diagonal entries in 0, the diagonal entry is updated using 

uj p <- i/(s pp + A) + o' lp {® n )- x e lp - e pp (13) 

Overall, the above steps lead to the update formula: © < — © + (uie p + e p u)'). 

Note that the above operations require evaluations of the residual eij. This requires 
computing at the onset the full gradient vector of the smooth part in ([7]) at Q\ p . When a 
coordinate of the vector 0\ p gets updated, the entire gradient vector changes — this update 
can be achieved in 0(p) flops. Note that in case Qi = 0, no update is required. Hence, 
if on average the number of non-zeros in p i, Q p \ is k, then the update (ITU])-(1TT]) requires 
an overall 0(pk) flops which, for k <C p leads to a significant reduction over the cost of a 
dense matrix/ vector multiplication i.e. 0(p 2 ). Algorithm UJ summarizes the updating steps 
described above. 

The above description was for updating the p th row/column of the matrix 0. This needs 
to be done for every row/column — one full sweep across the p rows/columns defines one 
iteration of our algorithm. We now describe the full version of our algorithm in Algorithmic 
we name it: Primal INexact Minimization for Graphical Lasso (PINE-GL ). 

Algorithm 2 Primal Inexact Minimization for Graphical Lasso (PINE-GL ) 
Inputs: S, A. Initialization: (©,© _1 ). 

1 For every row/column i G {1, 2, • • • ,p, 1, 2, • • • }, perform steps 2-3 till convergence. 

2 Permute the matrix such that the i th row/column is the p th i.e. of the form (EJ). 
Obtain the matrix (@n) _1 via rank-one updates (see Section 12.1. 2p . 

Update the matrix using Steps [lj- [3] in Algorithm UJ : © ^— © + (u)e' p + e p Q') 

Obtain (@) _1 via rank-one-updates (see Section I2.1.2p . 

Re-permute the matrix to get back the original rows/column indexing. 

3 Assign (0, (0)- 1 ) <- (0, (0)" 1 ) 

4 Upon convergence, the estimates at A: (0a, (0a) _1 ) := (0, (0) _1 ) 



Convergence criterion The convergence criterion is based upon the relative difference 
in objective values between two successive iterations (i.e. sweeps across all the p rows/- 
columns), being less than a threshold. As described later, the objective value is computed 
on the 'fly', so expensive log-det computations need not be done separately. 
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Initialization of precision and covariance matrices PINE-GL requires as input, 
initialization for the tuple (0, _1 ). In case no prior choice for the input initialization is 
available, we use 1 ^— diag(sn + p, ■ ■ ■ ,s pp + p). Note that the diagonals of _1 at the 
KKT optimality conditions for (jSJ) is precisely the vector (s n + p, ■ ■ ■ ,s pp + p). 

Often PINE-GL is used for computing a path of solutions to (J2J) via warm-starts — in 
such a case the tuple (0, _1 ) is available as a by-product of the algorithm (see Section 

2.1 Important Properties of PINE-GL 

We outline some of the important properties of our Algorithm — which is instrumental 
in making it flexible. For ease of exposition the technical details are relegated to the 
Supplementary Materials Section [Gil 

2.1.1 Positive definiteness of precision & covariance matrices across the iter- 
ates 

If the starting matrix 0^0, then every row/column update in Step-2 of Algorithm [2] 
preserves positive definiteness of the updated matrix. For a rigorous proof see Section IC.l .11 
(supplementary materials). 

2.1.2 Tracking precision and covariance matrices at every iteration 

The function (J7J) that arises while updating the p th row/column requires knowledge of 
(0ii) _1 . Of course, it is not desirable to compute the inverse from scratch for every 
row/column i, with a complexity of 0(p 3 ). However, if both the current precision and 
covariance matrices i.e. (0, (0) _1 ) are stored at every iteration then it is quite simple 
to obtain (0 n ) _1 via a matrix rank-one update as described in f )32|) . This costs 0{p 2 ) 
and moreover is amenable to parallelism. Similarly after every row/column update in 
its inverse can be obtained via a rank-one update as described in fl33|) . Details of this 
implementation involve careful attention to details that are presented in the Section IC.1.21 
(supplementary materials). 

Tracking both 0, 1 along the iteration provides flexibility to our algorithm in terms 

of: 

• We avoid the additional cost of matrix inversion — 0(p 3 ). 

• Termination at a given computational budget which is crucial for large scale problems 
and often desirable for exploratory analysis. Since the operating variable is — the 
precision matrix estimate is sparse 0. 

• They provide the perfect recipe for warm-starts, when one is interested in computing 
a path of solutions to (J2]) (see Section 12 . 2 j) . 

2 This is different from the dual optimization problem, where the estimated positive definite precision 
matrices need not be sparse 
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It gives a simple but efficient way to evaluate the log-determinant of the precision 
matrices along iterations, since computing the log-likelihood in large problems is a 
fairly expensive task. 



2.2 Path Seeking Strategy 



In many real life applications it is desirable to compute a path of solutions {@a}a over a 
grid of A-values Xk > A^-i > • • • > Ai. One method is to compute the solutions across 
different tuning parameter values independently of each other, say on different machines. 
Otherwise, they can be co mputed serially wherein warm-starts/continuation strategies turn 



out to be very effective ([Friedman et al. 

fe Al: ,(e- w 



2007al). 



In such a case 
is taken as an input|2|for the Algorithm [2] at A = A 



>\ k , yvy\ k ) ) is iaKt;ii as an mpuiij iui uie /\igunmni[zjai a — a& 
See Section [7] for experimental studies showing impressive improvements 



the estimate at A^ i.e. 
i, for every k = K, . . . , 2. 



3 Convergence analysis 



In this section we will analyze the convergence properties of Algorithm [2j We summarize 
below the novelty and importance of addressing convergence analysis in this paper: 

Firstly, our p roposal is not a c onvent ional form of block coordinate descent as described 



in 



Tseng! (120011 ); iFriedman et al.l (J2007bJ), where the partial optimization problem (with the 



other variables fixed) is completely optimized. A complete-block coordinate minimization 
when applied to our problem requires a full minimization in Step 2 of Algorithm [2j over the 
2 th row/column. We differ by replacing this conventional full optimization strategy by a 
partial optimization — namely one pass of coordinate descent as described in Algorithm [TJ 
Secondly, our objective function of interest in non-smooth and due to the symmetry of 
the problem, the blocks i.e. the rows and columns have shared elements. Since 9 12 = 9 21 the 
value gets updated twice — once at row/column=l and the o ther at row/ c olumn = 2. Con- 
ventional forms of block coordinate minimization theorems ( ITsengi . l200ll ) for non-smooth 



functi ons demand separability (in blocks) — so they do not apply directly. IWEN et al. 
(120091 ) highlight thi s issue of overlapping entries and provide a proof of convergence. The 
work of IWEN et al.l ( 120091 ) considers smooth objectives — hence the results do not directly 
apply to our problem. 

Note that by construction the sequence of precision matrices produced by Algorithm [2] 
results in a monotone decreasing sequence of objective values. Even if the objective values 
converge (which is true if they are bounded from below), it is not necessary that the point 
of convergence corresponds to the mini mum of t he pro blem (jSJ) — the sequence of precision 



matrices need not converge either (see ITsengi (120011 ) for discussions). We address these 



issues and show that that the precision matrix estimates converge to the minimum under 
the mild assumption A > 0. Convergence holds for A = under the extra assumption that 
S y 0. 

The convergence analysis we present here is to the best of our knowledge novel. 



3 Note that the primal formulation is unconstrained 
problems. 



so warm-starts do not run into infeasibility 
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We start with an important Lemma appearing in|Lu| (120101 ) [Proposition 3.1]: 



Lemma 1. For every A > 0, problem (dp has a unique minimizer — 0^, which is (strictly) 
positive-definite and satisfies: 

filpxp > ©1 > al pxp 
for scalars a, (3, depending upon S, X,p with oo > (3 > a > 0. 

We are now ready to state the main theorem establishing the convergence of Algorithm [2] 

Theorem 1 (Asymptotic Convergence of PINE-GL ). Take A > 0. Let be the estimate 
of the precision matrix obtained at iteration k i.e. on completion of Step 3 of Algorithm^ 
Then the following hold true: 

(a) The sequence of objective values is monotone decreasing : 

g(&k + i)<g(& k ),Vk>l. (14) 

The sequence {g(@k)}k converges to the optimal solution of problem (0|). 

(b) The iterates ®k >~ 0,VA; and the sequence converges to 0^ — the unique solution to 

Proof. The proof, which is rather detailed and technical is provided in the Appendix, Sec- 
tion [O □ 

4 Some variants of PINE-GL 

This section discusses some variations to our proposal PINE-GL — leading to two important 
variants. 

A variant of Algorithm [1] is having a counter for the number iterations for StepEJ say T a . 
Our proposal of inexact minimization and for that matter the overall complexity analysis 
demands T Q = 0(1) i.e. T a p. In our numerical experiments we found T Q < 2 to be quite 
practical. 

If T Q is taken to be arbitrarily large, we get the conventio nal form of cyclical coo rdinate 



descent used for £ 1 regularized quadratic programs (QP) (IFriedman et al.l . 12007a] ). The 
magnitude of T Q depends upon the accuracy of the solution for the l\ regularized QP. In 
general, for a high accuracy solution, this can be arbitrarily large. If T Q = 0(p) this leads to 
a 0(p 3 ) complexity of Algorithm [TJ See Section [5] for details. We call this variant Primal 
Exact Minimization for Graphical Lasso i.e. PEX-GL — this is the more conventional 
form of block coordinate descent applied on the problem ([2]). 

We now proceed to discuss another simple but important variant of our algorithm PINE- 
GL , namely a 'growing' strategy — which we call Primal GRowth for Graphical Lasso 
i.e. PGR-GL . 
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4.1 Primal Growth for Graphical Lasso (PGR-GL ) 



Given an initial working dimension po (typically po = 1) and estimates of the precision 
and the covariance matrix (© poXpo , (© P0 xpo) _1 )' Algorithm [3] (Supplementary Materials, 
Section IC.2p describes the task of obtaining the solution to (|2J) (with S having dimension 
p x p). The main idea is to perform an initial forward pass by incrementally appending 
rows/columns and operating Step 2 of Algorithm [2] on the just added row/ column. Once 
the growing matrix is saturated to have p rows/columns — we make further passes through 
the p rows/columns, via Step-2 of Algorithm [2], till convergence. Since the 'growing' stage 
of the algorithm performs mainly cheap computations, it helps in providing pretty accurate 
warm-starts/ initializations @, (0) _1 to PINE-GL , within a very short amount of time. 
See also results in Section [7J When the task is to solve ([2]) for a single value of A, the 
method PGR-GL often turns out to be quite competitive with PINE-GL . 

Remark 1. The convergence of PGR-GL is straightforward. Firstly, it is not hard to 
see that the iterates maintain positive definiteness of the precision and its inverse and fur- 
thermore, since PINE-GL comes into action after one full-sweep of incrementally growing 
rows/columns, the convergence analysis for PINE-GL carries over. 

5 Computational Complexity 

Here we describe the computational complexities of our proposed algorithms PINE-GL , 
PEX-GL and PGR-GL . We provide a summarized report here, the details are available in 
Appendix, Section IBl 

Cost of PINE-GL : Every row/column update requires 0(p 2 ), and for a full sweep across 
p rows/columns — this is 0(p 3 ). For k full sweeps across p rows/columns this is 0(np 3 ), 
typically convergence occurs within k « 2 - 10. See Section TB.11 

Cost of PEX-GL : For every row/column the cost at the worst is 0(p 3 ). For a total of 
k' (~ 4-10) sweeps across all rows/columns the cost is 0(k'p 4 ). The cost may reduce to 
0{p 3 ) in case A is quite large. See Section IB~2l 

Cost of PGR-GL : The cost here is 0(p 3 ) as in PINE-GL - - but the constants are 
generally better than that of PINE-GL . See Section fB. 31 



6 Related work 



In this section we briefly describe some of the state-of-the art algorithms for criterion (T5]), 
their computational complexities and their fundamental diffe r ences with our proposal (s). 

The block coordinate proposals of Banerjee et al. ( 2008 ); Friedman et al.l ( 2007bl ) are 
related to our proposal — they solve the dual of the problem (J2J), which is given by: 



max — logdet(S + V) — p. 

|[V||oo<A 



(15) 



By strong duality the optimal solution of problem ffTBI) and ([2]) are the same, the primal- 
dual relationship being (@) _1 = S + V. (|15p operates on the covariance matrix whereas 
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the primal problem (j2j) operates on the precision matrix. As pointed out earlier, there is 
significant difference in pursuing the primal approach versus the dual. Ofte n in real-life ap 



plica tions (as is the case in a principal motivating application for this paper (lAgarwal et al. 



201 lh ) one desires an approximate solution since it gives a fairly good statistical estimate 



for the main statistical estimation problem. An approximate solution in the dual space 
need not translate to one of similar accuracy in the primal space according to criterion 
(J2J). Further the dual approach does not produce sparse precision matrices — if V solves 
(I15p . then the precision matrix © = (S + V) -1 is not sparse unless ( !T5|) is solved till high 
tolerance (10~ 8 -10 -10 ) . Ad-hoc thresholding strategies / post-processing strategies can 
be used to sparsify © — but positive defin iteness is not g u arant eed. 

The block coordinate maximization of iBanerjee et al.l (120081 ) on (fT5"j) . requires solving 
a box-constr ained QP comp l etely - - with cost 0{p 3 ). The GLASSO (Graphical Lasso) 
Algorithm of iFriedman et al.l (j2007bl ) minimizes the dual of the same box-constrained QP 
- an t\ regularized QP via cyclical coordinate descent. In the worst case this can be 
0(p 3 ), in case the solutions are very sparse this is 0(p 2 ). Inexact minimization strategies 
for GLASSO do not guarantee convergence. GLASSO need not produce a sparse and positive 
definite precision matrix unless it converges to a high a ccuracy. 

To s ummarize, both the block-coordinate proposals of lBanerjee et al. 1 (120081 ): IFriedman et al.l 
(l2007bl ) have a worst case cost 0(p 4 ) — the latter c an im prove to Q( p 3 ) if A is very la rge. 

The gradient based algorithm of IBanerjee et al.l ( 120081 ) inspired by iNesterovl ( 120051 ) has 
a per- iteration complexity 0{p 3 ) and overall complexity 0(- — ) (for an e > accurate 
solution). 

Another very efficient gradient-based algorithm is SMACS proposed in[Lu| ( 2010 ). which 
also solves the dual formulation. This has per iteration complexity 0(p 3 ) (due to expensive 
matrix operations like eigen-decompositions, matrix inversions) and an overall complexity 
ofO(£). 

The number of iterations taken by GLASSO (IFriedman et all l2007bl ; IBanerjee et all 
200Sf ) and PINE-GL (and its variants like PEX-GL , PGR-GL ) i.e. full sweeps across all 



rows and columns are relatively small in most examples — of the order of 4-10. For a 
solution of similar accuracy, the number of gradient iterations for SMACS is often of the 
order of hundreds (or even more than a thousand) for problems of size 1000/1500. 

It appears that most existing algorithms for solving the sparse covariance selection 
problem have a complexity of 0(p 4 ) or possibly larger, depending upon the algorithm used 
and the desired accuracy of the solution — making computations for fl2]) almost impractical 
for values of p much larger than 1000/1500. 

In contrast, every row/column update of PINE-GL is 0(p 2 ) — overall for k sweeps 
across all rows/columns this is 0(kp 3 ), where k denotes the total number of sweeps across 
all the rows/columns (See Section [5]). This is clearly an order of magnitude improvement 
over existing algorithms and is further substantiated by our experiments. 
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7 Experimental Studies : synthetic examples 



This section provides a comparison of our proposed fitting methods with some state-of-the- 
art algorithms for the optimization problem (j2J). 

We use our main proposal PINE-GL , its close cousin PGR-GL , and the variant PEX- 
GL for comparisons. 



Among t he existing algorithms, |LuJ (120101 ) was observed to be better than the p r oposal 
of Lu (2009 ), so w e used the former for our comparisons. IScheinberg et al. (l2010h : I Yuan I 
(120091 ); iBoyd et al.l ( 120111 ) describe algorithms based on the Alter nating Direction Methods 
of Multipliers — among them the algorithm of IBoyd et al.l (120111 ) was publicly available at 
Stephen Boyd's website. We experimented with this algorithm, but found it to be slower 
than GLASSO , so we did not include it for our comparisons. 

We thus compared our proposals with two very efficient algorithms : 

GLASSO : The algorithm of iFriedman et al.l (l2007bl ). We used the MATLAB wrapper 



around their Fortran code — available at http : / /www-stat . Stanford . edu/~tibs/glasso/ index . ] 

SMACS : denotes the algorithm of lEul (bold ). We used the MATLAB implementation 

smooth_covsel available at http : / /people . math . sf u . ca/~zhaosong/Codes/SM00TH_C0VSEL/ , 

Note that the default convergence criteria for the above two algorithms are different - 
GLASSO checks the successive changes in the diagonals of the precision matrix, whereas 
SMACS relies on duality gap. Moreover GLASSO and SMACS operate on the dual, whereas 
our proposals PINE-GL , PGR-GL , PEX-GL operate on the primal. Since, solving (j2D 
is the main goal, to make comparisons fair, we compared the primal likelihoods of the 
estimates produced by the algorithms. 

A relatively weak convergence criterion on the dual is typically quite far off from a sparse 
and positive definite precision matrix. The GLASSO algorithm tracks a precision matrix 
© and covariance matrix W along the iterations but (0) _1 7^ W and the discrepancy 
can be quite large before the algorithm converges to a high accuracy in the dual space. 
Furthermore, even if the estimated precision matrix (prior to convergence) is sparse it need 
not be positive definite. SMACS produces estimates of precision matrices along the 
iterations — though they are positive definite, they are dense. Arbitrary thresholding (to 
zero) of the smaller entries may destroy positive definiteness of the matrix. 

Our proposal on the other hand at every iteration tracks the precision matrix (which is 
both sparse and positive definite) and its (exact) inverse. 

In order to make the primal and dual problems comparable we consider the times taken 
by the algorithms to converge till a relatively high tolerance i.e. TOL=10 -5 , where 



Convergence Test Criterion: 



(g(Q k ) - g(Q,)) 



< TOL. 



(16) 



Here 0^ is the estimate of the precision matrix produced by the respective algorithm at the 
end of iteration k, and g(0*) is the estimate of the minimum of (EJ) obtained by taking the 
minimum over different algorithms after running them for a large number of iteration^. 



4 In our examples we ran PINE-GL , PEX-GL , PGR-GL (each) for 30 iterations. They were enough to 
give solutions till an accuracy of 10~ 8 . 
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All of our computations are done in MATLAB 7.11.0 on a 3.3 GhZ Intel Xeon processor, 
with single-computational-thread computations enabled. Our codes are written in MAT- 
LAB and C H GLASSO is written entirely in Fortran, smacs is written in MATLAB - 
we don't think this puts SMACS at a (timing) disadvantage, since the major computations 
are matrix operations which are very well optimized in MATLAB. 



7.1 Algorithm Timings 



The simulation examples we used were inspired by |Lu| (120101 ). The population precision 
matrix having approximately 0.01 proportion of non-zeros is generated as follows. 

We generate a matrix A pxp with entries in { — 1,0,1}, with proportion of non-zeros 0.01. 
The —1 and 1 occur with equal probability. A is symmetrized via A i— 0.5 ■ (A + A'). 
All the eigen-values of A are lifted up by adding a scalar multiple of a identity matrix 
: -f- A + Tl pxp such that the minimal eigen-value of S _1 is one. The (population) 
covariance matrix is taken to be E. We then generated ~ MVN(0,H),i = 1,...,N. 
The sample correlation matrix was taken as S. 

We considered a battery of examples with varying N, p: 
(a) N e {500, 1000, 2000} for p = 1000 and (b) N e {2000, 3000, 4000} for p = 1500 

Table [1] summarizes the timing results (in seconds) for the examples above for different 
algorithms. The timings are shown for different A values — algorithms are cold-started 
at their default starting points. We see that PGR-GL , PINE-GL are always the winners 
and often by multiplicative factors. PINE-GL turns out to be the clear winner overall. 
PEX-GL turns out to be slower than PINE-GL - - which supports our crucial idea of 
inexact minimization in the inner-blocks and also supports our complexity analysis. As 
expected, the timings for the block coordinate algorithms deteriorate for smaller values of 
A. For dense problems (which are arguably harder problems for the primal formulation), 
PGR-GL consistently performs quite well. PEX-GL and GLASSO often perform similarly. 
SMACS tends to be quite slow for larger problems, when compared to the block coordinate 
counterparts. We found SMACS to be quite competitive for very small values of A — but 
these (almost) unregularized solutions are not much statistically meaningful unless n > p. 
SMACS faced problems with convergence for n < p situations where S was low-rank. 

These results demonstrate the impressive comparative performances of PINE-GL and 
PGR-GL compared to current state-of-the-art methods — making it probably a method of 
choice in scenarios where it is possible to run the fitting algorithms till a high tolerance. As 
mentioned earlier, the primal formulation is particularly suited for this task of delivering 
solutions with lower tolerances. It operates on the primal <^ delivers a sparse and positive 
definite precision matrix and its exact inverse. Table E] (Supplementary Material, Section 
IG.3P shows average timings in seconds across a grid of ten A values with varying degrees 
of accuracy. PINE-GL , PEX-GL and PGR-GL return lower accuracy solutions to (J2J) - 
in times which are much less than that taken to obtain higher accuracy solutions. The 
gains are rather substantial given the limited scope of the dual optimization problems in 
the 'early stopping' paradigm. 



5 The C code was generated via the embedded-matlab function emlc, an automated C code generator 
in Real Time Workshop in MATLAB. 
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D / N 


% of 
nnz 


PGR-GL 


Algorithm Times ( 
PEX-GL PINE-GL 


sec) 
GLASSO 


SMACS 




92 


48.8 


140.4 


119.4 


143.8 


308.5 


1 x 10 3 / 2 x 10 3 


78 


143.3 


130.5 


94.7 


151.8 


288.6 




46 


149.7 


167.4 


108.6 


220.1 


217.7 




85 


79.1 


143.1 


66.9 


130.4 


398.7 


1 x 10 3 / 1 x 10 3 


71 


143.2 


150.6 


69.0 


160.5 


408.2 




49 


132.6 


225.3 


187.2 


295.7 


464.2 




91 


82.4 


93.8 


66.1 


94.4 


— 


1 x 10 3 / 0.5 x 10 3 


73 


81.5 


123.2 


119.5 


179.7 


— 




48 


354.8 


382.4 


340.2 


544.5 


— 




86 


223.2 


258.7 


186.3 


571.1 


2310.4 


1.5 x 10 3 / 4 x 10 3 


72 


221.8 


353.4 


186.5 


577.8 


1534.5 




47 


401.8 


656.1 


488.4 


851.8 


1062.8 




86 


212.6 


228.9 


177.4 


533.3 


1736.3 


1.5 x 10 3 / 3 x 10 3 


72 


221.3 


256.4 


186.0 


573.7 


2017.2 




48 


525.6 


675.8 


494.2 


880.4 


1521.4 




85 


283.3 


364.8 


222.7 


566.5 


1759.3 


1.5 x 10 3 / 2 x 10 3 


72 


222.6 


258 


186.3 


574.2 


2246.9 




40 


757.8 


1019.7 


706.6 


1186.6 


1780.3 



Table 1: Table showing the times in seconds till convergence to a tolerance of TOL= 10~ 5 
fll6p . for different algorithms for different problem set-ups. For every combination of (p, N) 
three different rX values are considered — as indicated by the % non-zeroes in the final 
solution for PEX-GL . All algorithms are warm-started at their default starting values. 
The ' — ' corresponding to SMACS indicates that the algorithm did not converge for this 
example with N < p. 

The next section compares different algorithms as 'path-algorithms'. Path-based-algorithms 
and algorithms operating on a single value of A are quite different performance-wise. An 
algorithm that tends to be very fast as a path algorithm need not be as good at a single 
value of A. This is because, a good warm-start improves the convergence-rate of the algo- 
rithm. Similarly, an algorithm that is very good at a single value of A, may not benefit 
much from warm-starts (a typical example being interior point methods). This is why we 
compare our proposals in both scenarios. 

7.2 Tracing out a path of solutions 

Continuing with Section 12.21 we see what happens to the rate of convergence of PINE-GL 
in presence of warm-starts. Note that SMACS and PGR-GL do not allow for warm-starts 
but GLASSO and PEX-GL do. The data is the same as used in the previous section. We 
took a grid of ten A values as follows: the off-diagonal entries of the sample covariance 
matrix S were sorted as per their absolute values and ten A values were chosen from the 
entire range (along equi-spaced percentiles of the absolute values in S) — the largest A 
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value being r] maiX q and the smallest was r/ min ■ q, where q := maXj>j \Sy\. All algorithms were 
run till a tolerance of 10~ 5 . Table [2] summarizes the results. 



p/N 



Imax/ ^min 


average % 


(xlO" 2 ) 


of nnz 


16/ 0.64 


61.3 


21/ 0.83 


62.8 


28/2 


66.7 


13.1 / 0.4 


62.4 


14.3 / 0.53 


62.8 


16.0 / 0.61 


63.1 



Algorithm Times (sec) 
PINE-GL GLASSO SMACS 



speed-up 
(PINE-GL 



1 x 10 3 / 2 x 10 3 
1 x 10 3 / 1 x 10 3 
1 x 10 3 / 0.5 x 10 3 
1.5 x 10 3 / 4 x 10 3 
1.5 x 10 3 / 3 x 10 3 
1.5 x 10 3 / 2 x 10 3 



77.27 
116.38 
105.52 
260.54 
280.19 
267.03 



144.10 
202.73 
315.17 
579.52 
613.67 
697.65 



250.43 
412.14 

145.35 
1631.6 
1892.1 



1.56 
1.52 
1.89 
1.28 
1.23 
1.53 



Table 2: Table showing the comparative timings (in seconds) of the three algorithms PINE- 
GL , GLASSO and SMACS across a grid of ten A values. Times are averaged across the ten 
A values. The averaged % of non-zeros in the final solution across the different A values 
along with the limits of the A values are also shown. The last column shows the speed-up 
factor for PINE-GL using warm-starts over the time spent to compute the solutions of the 
same accuracy without using warm-starts. 



As the column on speed-up factor shows, the path algorithm of PINE-GL is much 
faster than obtaining the solutions at the same values of A without warm-starts. PINE-GL 
continues to perform very well when compared to the path algorithm GLASSO . 



8 Real Application: Learning Precision graphs for Movie- 
Movie Similarities 

MovieLens Data Set: We study an application of the inverse covariance estimation 
method on a dataset obtained from a movie recommendation problem. We use the bench- 



mark MovieLens- 1M dataset available at http : //www . grouplens . org/node/ 12 , which con 



sists of 1M explicit movie ratings by 6,040 users to 3,706 movies. The explicit ratings 
are on a 5-point ordinal scale, higher values indicative of stronger user preference for 
the movie. The statistical problem that has received considerable attention in the lit- 
erat ure is that of predicting exp licit ratings for missing user-movie pairs. Past stud- 



ies ( ISalakhutdinov fc Mnihl . 120081 ) have shown that using movie-movie similarities based on 
"who-rated-what" information is strongly correlated with how users explicitly rate movies. 
Thus using such information as user covariates helps in improving predictions for explicit 
ratings. Other than using it as covariates, one can also derive a movie graph where edge 
weights represents movie similarities that are based on global "who-rated-what" matrix. 
Imposing sparsity on such a graph is attractive since it is intuitive that a movie is gener- 
ally related to only a few other movies. This can be achieved through PINE-GL . Such a 
graph provides a neighborhood structure that can also help directly in better predicting 
explicit ratings. For instance, in predicting explicit rating by user i on movie j, one can 
use a weighted average of ratings by the user in the neighborhood of j derived from the 
movie-movie graph. Such neighborhood information can also be used as a graph Laplacian 
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to obtain bette r regularization of user factors in matrix factorization model as shown in 



Lu et al.1 (120091 ). 

Other than providing useful information to predict explicit ratings, we note that using 
who-rated-what information also provides information to study the relationships among 
movies based on user ratings. We focus on such an exploratory analysis here but note that 
the output can also be used for prediction problems following strategies discussed above. 
A complete exploration of such strategies for prediction purposes in movie recommender 
applications is involved and beyond the scope of this paper. 

We define the sample movie- movie similarity matrix as follows: for a movie j, Xj is the 
binary indicator vector denoting users who rated movie j. The similarity between movie j 
and k is defined as Sjk = X]Xk . The movie- movie similarity matrix S thus obtained 

is symmetric and postive semi-definite. 



8.1 Timing comparisons 

As noted earlier, for this application we use criterion 02]) in a non-parametric fashion, 
where our intention is to learn the connectivity matrix corresponding to the sparse inverse 
covariance matrix. We will first show timing comparions of our method PINE-GL (the 
path-seeking version) along with glasso and SMACS . We see that PINE-GL is the 
only method that delivers solutions within a reasonable amount of time — the estimated 
precision matrices are used to learn the connectivity structure among the items. 

We ran GLASSO for nine A values - which were equi-spaced quantiles between the 8-th 
and 75-th percentile of the entries {|sjj|}j >: ,- — this range also covers estimated precision 
matrices that are quite dense. 

The path versions of PINE-GL , GLASSO and SMACS were used to obtain solutions on 
the chosen grid of A values. The timings are summarized below: 

• PINE-GL produced a path of solutions across the nine A values using warm-starts in 
a total of 6.722 hours. 

• The path version of GLASSO on the other hand, could not complete the same task of 
computing solutions to fl2]) on the same set of nine A values, within two full days. 

• We also tried to use SMACS for this problem, but it took more than 14 hours to 
compute the solution corresponding to a single value of A. 

The timing advantages should not come as a surprise given the computational complexity 
of PINE-GL is order(s) of magnitude better than its competitors. The performance gap 
becomes more prominent with increasing dimensions — traces of evidence were observed 
in Section [3 



8.2 Description of the Results 

Figurel2l(Section l0.4l of Supplementary Materials) displays the nature of the edge-structures 
and how they evolve across varying strengths of the shrinkage parameter A for the whole 
precision-graphs produced by PINE-GL . 
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For a fixed precision matrix, a natural sub-set of 'interesting' edges among the all- 
possible p(p — l)/2 edges are the ones corresponding to the top K absolute values of partial 
correlation coefficients. The nodes corresponding to these K edges and the edges of the 
precision graph restricted to them form a sub-graph of the p x p precision graph. We 
summed the absolute values of the off-diagonal entries of the precision matrices across the 
different A values. The (averaged) p(p — l)/2 values were ordered and the top ten entries 
were chosen. These represent the partial correlations having the maximal strength (on 
average) across the different A values taken. There were 20 vertices corresponding to these 
top ten partial correlation coefficients. Figure [I] shows the sub-graphs of the movie-movie 
precision graphs restricted to the selected 20 movies, across different A values. The movie to 
ID mappings are in Table HJ in Section 10.51 at Supplementary Materials Section. The edges 
in these subgraphs provide some interesting insights. For instance, consider the sub-graph 
corresponding to the largest A (highest sparsity) . Part of strong connectivity among movies 
0,1,2,3 is expected since 0,1 and 2,3 are sequels. It is interesting to see there is a connection 
between and 3, both of which are Sci-fi movies related to aliens. Other connections also 
reveal interesting patterns, these can be investigated using the IMDB movie database. 

Another very relate d application of the set-up described above is the one appearing 
in lAgarwal et al.l (120111 ). where one models the raw who-rated-whom binary data using a 



multivariate random effects model. 



9 Conclusion and Remarks 

We propose a flexible, scalable and efficient algorithmic framework for large scale £± pe- 
nalized inverse covariance selection problems that is used in several statistical applications. 
The framework gives rise to our main proposal PINE-GL , its close cousin PGR-GL and 
PEX-GL — all of them operate on the primal version of the problem (j2j). The key in- 
gredient to scalability and efficiency requires a novel idea — that of inexact-minimization 
over an exact-one in the row/column blocks. The non-smoothness in the objective, posi- 
tive definiteness of the precision matrices and the overlapping entries of the rows/columns 
necessitates a separate convergence analysis. We address this issue. This observation imme- 
diately brings down the per-iteration complexity of the algorithm by an order of magnitude, 
from 0(p 3 ) to 0(p 2 ). On problems of size p = 1 — 3 K, our proposal performs extremely 
well when compared to state of the art methods designed for problem (T2J). Our proposal 
tracks a sparse, positive definite precision matrix and its exact inverse i.e the covariance 
matrix at every iteration and is suited to return a solution with low/moderate accuracy 
depending upon the application task at hand. In particular, this makes it particularly suit- 
able for large scale covariance selection problems where a very high accuracy solution is not 
practically feasible. PINE-GL is particularly suitable for computing a path of solutions on 
a grid of A values using warm-starts — and it performs better when compared to existing 
path algorithms. 

Our algorithm is supported by complexity analysis which shows that it is favorable 
over existing algorithms by an order of magnitude. Our proposals are well supported by 
numerical experiments on real and synthetic data. 
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Figure 1: Pictures of subgraphs of the precision matrices induced by the 20 movies corre- 
sponding to the largest absolute partial correlations (averaged across difefrent A values). 
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Exact Thresholding of Covariance Matrices Fairly recently iMazumder &: Hastie 
(120111 ) proposed an exact thresholding strategy which becomes useful if the non-zeros of 
the solution to f$Z§ breaks down into connected components. The same components can 
be recovered by looking at the non-zeros of the matrix rj\(S), where ry (-) = sgn(-)(| • I — A) 



i s the component-wise soft-thresholding operator at A. As shown in IMazumder fc Hastie 
(120111 ). this strategy can be used as a wrapper around any algorithm for solving ($Z§ for 
sufficiently large A so that it admits a decompostion into connected components. Since the 
aforementioned strategy heavily relies on having a scalable algorithm for fl2]), determined 
by the size of the maximal component - - our proposal opens the possibility of solving 
problems (j2J) for an even wider range of A-values. 



Extensions to other convex regularizers Though we were concerned with ([2]) in this 
paper, our framework can accommodate other variants of block-separable regularizers, in 
place of the l\ norm on the entries of the matrix. This includes: 

1. The weighted l \ norm i .e. ; where Pij > 0, Vi,j are given scalars. See 



Friedman et al.l (l2007bl ) ; iFan et al.l (120091 ) for use of this penalty. 



2. The group lasso /node-sparse ( IFriedman et al.l . 120101 ) norm on the blocks of the pre- 



cision matrix: Ef=iVEi^r 



3. The elastic net regularization i.e. a l%l + (1 — ot) E« @ij ( jZou &: Hastid . 120051 ) 

These all are achieved by modifying Algorithm (TJ with an inexact minimization strategy 
for the above penalties. 
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A 



APPENDIX : Proofs and Technical Details 



This Section accumulates the technical details and proof details that were outlined in the 
text. 



A.l Theorem [T] : Set up and Proof 

Firstly we will like to point out some important points about the convergence result which 
also sheds important light on the properties of the solution fl2J. If A > 0, the sequence of 
objective values and the estimates are bounded below (see Lemma [1]). Then by standard 
results in rea l-analy s is, the sequence of objective values converge to g^ (say). It is not clear 
however (see iTsengl (120011 ). and references therein for counter-examples) that the point of 
convergence i.e. corresponds to the minimum of the problem ([2]). Fortunately however, 
we show in this section that actually is the optimum of the minimization problem (T2]). 

Observe that the convex optimization problem <^j, for A = and S rank-deficient will 
have its infimum at — oo. However, it turns out that for A > 0, this condition is avoided 
and the optimal value of the problem is finite. 



As is the case for many convex optimization problems ( iBoyd h Vandenberghd . I2004J . see 
for example), its is not necessary that problem (j2J) will have a unique minimizer. It turns 
out, however, that as soon as A > 0, problem (J2J) has a unique minimizer. The assertions 
made above are consequences of Lemma [H 

We need to set up a formal framework and present a few lemmas leading to the proof. 



The Ei Regularized Proximal Map A variant of Step [2J in Algorithm [T] is one where 
we use a proximal step( Nesterov . 20071 ). instead of one sweep of cycli cal block- c oordi nate 
descent. Recall that the function g p (0 lp ) ((Tj) is in the composite form (INesterovl . 120071 ) : 



ffp(»ip) = / P (0i„) + 2A||0i p ||i (17) 
where f p (-) denotes the smooth part given by: 

fp&p) = °ip{(sp P + A)e n 1 }0i P + 2s' lp lp 

It is easy to see that the gradient V/ p (-) of the function f p (-) is Lipschitz continuous i.e. : 

||V/ p (u)-V/ p (v)|| 2 <L p ||u-v|| 2 (18) 

and an estimate of L p = 2(s pp + A)!!©^ 1 ^. The proximal step or the generalized gradient 
step (in place of the coordinate-wise updates ([TO]) ) is given by the following: 



uj = argmin - {0 lp - -^-V f p {0 lp )\\ 2 2 

^gsftp-i L L p 

~ 1 ~ 2A ~ 

= v (o lp --vf p (e lp y,—)-e lp 

L p L p 



2A||w||i} - e 



where Vf p (d lp ) = 2(s pp + A)(0n) 1 lp + 2s lp and r/(-;7) = sg n (") 
thresholding operator applied component-wise to a vector • G 



(19) 
(20) 

7)4. is the soft- 
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In what follows, we will study a minor variation in Step 2 of Algorithm [TJ Instead of 
using one sweep of cyclical coordinate descent, we will use one proximal step as described 
in (120|) . The convergence result with the cyclical coordinate-descent version is no different 
but simply adds to the technicality of the analysis. 



Properties of the soft-thresholding operator Before going into the proof we need 
a lemma pertaining to an important property of the soft-thresholding operator i.e. the i\ 
Regularized Proximal Map. 

For a function h : §l q i— )■ 3? with Lipschitz continuous gradient: 

||V/i(x)-V/i(y)|| <L||x-y|| (21) 



the following majorization property holds (IBeck fc Teboulld . 120091 . See for example, Lemma 
2.1) 

-||w - x||a + (V/i(x), w - x) + h(x) > h(w) (22) 
The minimizer wrt w for the l\ regularized problem: 

Maj(w;x) := -||w - + (V/i(x), w - x) + h(x) + A||w||i (23) 
is given by the proximal map or the soft-thresholding operator: 

M(x) :=argmin { — ||w - x||| + (V/i(x), w - x) + A||w||i} (24) 
w 2 

= argmin { ^||w - (x - yV/t(x))||2 + A||w||i} 

= V f(x-ivA(x));^ 

The following Lemma states an important property of the map M(x). 

Lemma 2. Consider the function H(-) defined by: 

H(w) = h(w) + A||w||i (25) 

with h(-) having the property in l[21\) . For any x G 9ft 13 and M(-) as defined in [Efy the 
following holds: 

|.(^(x)-ff(M(x)))>||x-M(x)||» (26) 

Proof. It can be shown usi ng elementary convex a nalysis and the properties of the map 
Maj(-) (EHD defined above, feeck fc Teboullel . 120091 . See Lemma 2.3): 

(x) - H(M(y)) > ^ ||Af(y) -y\\ 2 2 + L(y - x, M(y) - y) (27) 
Substituting y = x above we get the desired result in (1261) . □ 
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Proof of Theorem Q], part (a) The monotonicity follows by construction of the 
sequence of iterates ® k . 

The iterate ® k is obtained by updating all the p rows/columns of the matrix 0, cycli- 
cally. We now introduce some notation. Let us denote the successive row/column updates 
by: 

Update row/column 1 — > ®k,i (28) 
Update row/column 2 — > ® k)2 



Update row/column p — >■ @ fciP 

Further we use ®k,i[—i,i] € 3ft p_1 to denote the i th column of the matrix &k t % (excluding 
the diagonal entry). We need to estimate the difference in ®k,i-i and ©^ — note that 
they differ only in the i th row/column. 

To settle ideas and using the same set-up as in Section [21 we concentrate on row/column 
p. The difference ® k , p [—p,p] — ® k , P -i[~ PiP] can be estimated by using Lemma [2j To see 
this recall the framework of Algorithm [2] as described in Section [2j To update the p th 
row/column we need to consider a proximal-gradient step in the function g p (-) as described 
in ( II 7p . This function exactly fits into the framework of Lemma [21 for specific choices of 
L, h(-), H(-) and A. Let the Lipshcitz constant at this iterate be denoted by L k)P . Using 

the equality g(® k , P -i) ~ g(®k, P ) = 9 P (®k, P -i[-p,p\) - 9 P (®k, P [-PiP\) (which follows by 
construction) and Lemma [2] we have: 

-^-(0(e fcjP _i) - g(Q k ,p)) > \\® k , P [-P,P] - ®k, P -x[-P,p)\\l (29) 

Recall that we established in Section I2.1.1[ that Agorithm [2] produces a sequence of esti- 
mates ®k,i such that 0fc j >- 0. Further note that the minimum of ([2]) is finite (as A > 0, 
Lemma [1]). It follows that there exists p' > p > such that 

p'Ip X p >- @ k ,i >- plpxp^k (30) 

where I pxp is ap dimensional identity matrix. Since is a scalar multiple ( !T8|) of the spec- 
tral norm 11(0^. ^—2, — i]) _1 ||2, it follows from ( 130]) that mi kti L k i > and oo > sup ki L k ^. 

Thus using the monotonicity of the sequence of objective values g{@ k) i) for i — 1, . . . ,p, 
k > 1, the fact that the minimum value of ([2]) is finite and the boundedness of j^—, 
we see that the left hand side of (|29|) converges to zero as k — > oo. This implies that 
® k ,p[—p,p] — 0jfc, p -i[— p,p] — > as k — > oo i.e. the successive difference of the off-diagonal 
entries converge to zero as k — > oo. In particular, we have this to be true for every row/ 
column i 6 {1, . . . ,p} i.e. 

0fc,i[-M] - 0fc,i_i[-(i - - 1] ->■ 0, k -» oo for every % e {1, ... ,p} 

In the above, we use the convention fe o [— 1, 1] = ® k _ 1)P [—p,p]. 

Since is a bounded sequence it has a limit point — let be a limit point. 

Moving along a sub-sequence (if necessary), n k C {1,2,...} we have @ nfe — > ©oo. 
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Using the stationary condition for the update described in (|2T)|) . the p th row/column 
(off-diagonal entries) satisfies: 

(s pp + A)(© 00 ) n 1 (© 00iP [-p,p]) + sip + A sgn(©oo iP = 0. 

The above holds true for every row/column % G {l,...,p}. Using the above stationary 
condition along with the update relation for the diagonal entries as in Step [3] in Algorithm 
[TJ it is easy to see that the limit point @oo satisfies the global stationary condition for 
problem ([2]) i.e. 

-(e oo )- 1 + S + Asgn(e oo ) =0 
Thus we have established that g(@k) converges to the global minimum of the function 

g(-). 

Proof of Theorem [H part (b) For this part it suffices to show that there is a unqiue 
limit point for the sequence Note that we showed in Part (a) that every limit point of 
the sequence 0& is a minimizer for the problem (T5]) . Now by Lemma dj there is a unique 
minimizer of (T5]). This implies that has a unique limit point and hence the sequence 
converges to ©oo, the minimizer of g(-). 

B Complexity analysis details of PINE-GL ,PEX-GL 
and PGR-GL 

B.l Complexity of PINE-GL 

Step 2 of Algorithm [2] requires computing ©n, this requires 0((p — l) 2 ) (see Section I2TT1) . 
Algorithm [1] does one sweep of cyclical coordinate descent — this has worst case complexity 
0((p— l) 2 ) — in case the solution to the l\ regularized QP is sparse, the cost is much smaller. 
It should be noted here that any constant number of cycles (say T Q ) of cyclical coordinate 
descent will lead to a complexity of 0{T ■ {p — l) 2 ). As long as T„ C p (say T Q = 1/2) 
this leads to 0(p 2 ). This is followed by updating the covariance matrix with 0((p — l) 2 ), 
via rank-one updates. Hence Step 2, for each row / column has a complexity of 0(p 2 ), for 
p rows/columns this is 0(p 3 ). If k denotes the total number of full sweeps across all the 
rows and columns this leads to 0(np 3 ). In practice based on our experiments we found 
k = 2 — 10 is sufficient for convergence till a fairly high tolerance. While computing a 
path of solutions with warm-starts k is around 2-4 for different values of A. The value of k 
increases when A is very small so that the resultant solution 0* is dense — but since these 
A values correspond to almost un-regularized likelihood solutions, in most applications they 
are not statistically interesting solutions. 

B.2 Complexity of PEX-GL 

In case of using PEX-GL the analysis is quite similar to above but there are subtle dif- 
ferences. The complexity of matrix rank- updates remain the same 0(p 2 ), what changes is 
the number of coordinate sweeps required for Algorithm [T] to solve the inner £i regularized 
block QP till high accuracy. This problem is fairly challenging in its own right and is 
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computationally hard when p is a few thousand. Precise convergence rates of coordinate 
descent to the best of our knowledge are not known. This depends largely upon the data 
type being used. Often the number of coordinate sweeps i.e. k ca n be O(p) — lea ding to a 
complexity of 0(p 3 ). If (generalized) gradient descent methods ( iNesterovl . 120071 ) are used 
instead of cyclical coordinate descent — then the number of iterations k is of the order of 
0(1/ e), where e > is the accuracy of the solution. For e ~ 1/p, k pa p. Thus PEX-GL 
has roughly a complexity of 0(p 3 ) for every row/column update — leading to an overall 
cost of 0(p 4 ) for one full sweep across all rows/columns. If there are k' full sweeps across 
all rows/columns the total cost is 0(n'p 4 ). 

In case A is large enough so that every l x regularized QP can be solved quite fast i.e. 
0(p 2 ) — the total cost of PEX-GL reduces to 0(p 3 ). 



B.3 Complexity of PGR-GL 

The main difference of PGR-GL lies in the manner in which it updates the rows/columns 
via appending rows/columns in Steps 1-3 of Algorithm^ Steps 1-3 have a cost of Y7 m =i m2 > 
which is approximately p 3 /3. When compared with one sweep of PINE-GL , the growing 
step is approximately one-third of the cost of PINE-GL . One sweep of the growing strategy 
leads to inferior performance when compared to one sweep of PINE-GL . However, after 
a smaller number of sweeps, PGR-GL can obtain better likelihoods than PINE-GL . In 
some examples, as seen in the experimental section, PGR-GL is faster than PINE-GL in 
obtaining a solution with low accuracy. 
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C SUPPLEMENTARY MATERIALS 

This portion gathers some of the technicalities avoided in the main text of the article and 
the Appendix |A] 

C.l Properties of the updates of Algorithm [2] 

We present here a detailed derivation of the properties of the updates of Algorithm [2j 

C.l.l Positive-definiteness 

The updates described in Steps (CQ), (EJ), ([3]) in Algorithm [T] actually preserve positive 
defmiteness of 0, under the assumption that © >- 0. Using the decomposition for in 
©, it follows from standard p roperties of positive defmiteness of block partitioned matrices 



flBovd k Vandenberghd . 12004 ) that: 



y iff 0n y 0, 6 PP - d' lp @ile lp > (31) 

Observe that ©n = ©n, by construction. Further by the property of the update Step [3] 
(Algorithm |2]) we see that 

Kp ~ 0'i P ®iiOip = ~ — — t > °- 

This shows by (J2Q that ©V 0. 

A simple consequence of the above observation is that logdet(0) is finite if logdet(0) 
is so. 

C.1.2 Tracking (0, (0) x ) 

For updating the p th row/column, PINE-GL requires knowledge of (0n) _1 . Of course, it is 
not desirable to compute the inverse from scratch for every row/column i, with a complexity 
of 0(p 3 ). Assume that, before operating on the p th row/column we already have with us 
the tuple (0, (0) _1 ) — then it is fairly easy to compute (0n) _1 via: 

(0n) _1 = Wn - wipWpi/wpp, (32) 

where W := (0)~ x and the blocks Wn, Wjp, w p i, w pp of the matrix W, have the same 
structure as in 03]). This follows by standard-formulae of inverses of block-partitioned 
matrices — and the update requires 0(p 2 ). 

Once the p th row/column of the matrix is updated, we obtain 0. The matrix W : = 
(0) _1 is obtained via: 

W n = - '^W; w lp = (gHgg^ , (33) 

(Opp — p i(0ii) l 0i P ) (Opp — p i(0ii) l 0i P ) 
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Algorithm 3 PINE-GL with Growing Strategy : PGR-GL 



Inputs: A, Sp Xp and p x p matrices (0, (0) x ), where p < p. 
Set initial working row/column m = po + 1. 

1 Consider amxm dimensional problem of the form (j2J) with covarianc^ Si :mx i :m and 
initializations (0, (0) _1 ), of dimension m x m where 

<- blkdiag(0, 1 —tt); (©T 1 ^blkdiag((0)- 1 ,s mm + A) (34) 

\Smm ~T~ A) 

2 Apply Step 2 of Algorithm [2] with inputs Si :mx i :m , (0, (0)" 1 ) an d input dimension 
m. 

The above results in (0, (0) _1 ) — both m dimensional matrices. 
Assign (0, (0)- 1 ) <- (0, (0)- 1 ) 

3 Increase m — m+ 1, and go to Step-1 (if m < p); else go to Step-4. 

4 Apply Algorithm [2] with Si :px i : p, and initializations (0 pxp , (0 pX p) _1 ), till a desired 
tolerance. The output is the solution to problem (j2J). 



where as before the blocks of the matrix W, follow the same notation as in ()3]). The cost 
is again 0{p 2 ). Note that the diagonal entry w pp = l/(9 pp — pl (0 11 ) _1 lp ). 

The above recursion shows how to track (0,0 _1 ) (as well as (0,0 -1 )) as one cycles 
across the rows/columns of the matrix 0. 

C.2 Algorithmic Description of Primal Growth for Graphical Lasso 
PGR-GL 

This elaborates Section 14.11 in the text. Given an initial working dimension po (typically 
Po = 1) and estimates of the precision and the covariance matrix (0 PoXpo , (0p o xp o ) -1 )' 
Algorithmic describes the task of obtaining the solution to (J2]), with 0, S having dimensions 
p x p. 

C.3 Performances of PINE-GL and its variants for low— high ac- 
curacy solutions 

We now proceed to show how gracefully they deliver solutions of lower accuracy within a 
much shorter span of time making it feasible to scale to very high dimensional problems. 
As mentioned earlier, the primal formulation is particularly suited for this task of delivering 
solutions with lower convergence tolerance, since it delivers a sparse and positive definite 
precision matrix and its exact inverse. Table [3] shows average timings in seconds across a 
grid of ten A values with varying degrees of accuracy. In case the application demands a 
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relatively low accuracy solution, then the algorithms deliver solutions within fractions of 
the time taken to deliver a solution with higher accuracy. 



p/N 


average % 
of nnz 


Accuracy 
(TOL) 


Algorithm Times 
PGR-GL PEX-GL 


(sec) 

T ~\ T TV T"l — 1 / i T 

PINE-GL 








i n— 2 
10 


60.16 


68.29 


49.0 


1 x 10 3 / 2 x 


10 3 


61.3 


10 3 


80.59 


104.67 


67.31 








10" 4 


112.31 


134.13 


91.91 








10 -5 


140.08 


174.52 


120.75 








i n— 2 
10 


68.46 


92.32 


64.56 


1 x 10 3 / 1 x 


10 3 


62.77 


10 3 


95.49 


126.25 


95.47 








10" 4 


128.94 


167.35 


127.11 








10 -5 


174.44 


199.36 


177.06 








1 n— 2 
10 


82.94 


117.97 


65.94 


1 x 10 3 / .5 x 10 3 


66.67 


10 3 


127.17 


153.29 


96.69 








10 4 


181.99 


195.19 


140.37 








10~ 5 


252.95 


234.97 


200.43 








1 n— 2 
10 


149.89 


195.19 




1.5 x 10 3 / 4 


x 10 3 


62.44 


10 3 


189.52 


317.64 


204.77 








10 4 


266.63 


396.82 


275.75 








10~ 5 


344.26 


460.29 


333.55 








i n-2 
10 


145.59 


215.65 


141. o7 


1.5 x 10 3 / 3 


x 10 3 


62.78 


10~ 3 


203.86 


300.57 


201.62 








10" 4 


261.12 


397.72 


271.34 








10~ 5 


344.19 


477.64 


346.47 








io- 2 


149.13 


251.51 


169.37 


1.5 x 10 3 / 2 


x 10 3 


63.11 


10~ 3 


250.02 


354.75 


238.57 








10" 4 


319.36 


445.38 


317.51 








10~ 5 


414.91 


523.38 


408.77 



Table 3: Table showing average timings in seconds across a grid of ten A values with varying 
degrees of accuracy i.e. TOL. The column under average % of non-zeroes denotes the % of 
non-zeros in the optimal solution, averaged across the ten A values. No warm-start across 
A's is used. 

Table [3] shows that PINE-GL , PEX-GL and PGR-GL return lower accuracy solutions 
to (j2J) — in times which are much less than that taken to obtain higher accuracy solutions. 
Note that even the lower accuracy solutions correspond to sparse and positive definite 
precision matrices, with guarantees on 'TOL'. Even more interesting is the flexibility of 
methods like PINE-GL , PGR-GL to obtain sparse and positive definite solutions at low 
computational cost when compared to the times taken by the dual algorithms in Table 
[TJ These favorable comparative timings go on to support our claim of making large scale 
covariance selection very practical. PEX-GL turns out to the slowest among the three, 
PGR-GL and PINE-GL are quite strong competitors, with PINE-GL winning in majority 
of the situations. 
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C.4 Graphical display of sparsity patterns in the Movie-lens Graphs 



This is an elaborate version of Section IHT21 in the main text. Figure [2] represents the sparsity 
structures of the precision graphs obtained from the movie-movie similariti es. The graph 



struc tures are displayed under the Sparse reverse Cuthill-McKee ordering tlGilbert et al. 



1992)0 of the precision matrices as delivered by our algorithm PINE-GL for three different 



values of A. The presence of a 'dot' in the figure represents a non-zero edge weight in the 
corresponding movie-movie precision graph. The percentages of (off-diagonal) non-zeros 
are presented below each figure. The tapering nature of the graphs for larger values of A, 
show that the movies towards the extreme ends of the axes tend to be connected to few 
other movies. These movies tend to be conditionally dependent on very few other movies. 
The higher density of the points towards the center (of the left two figures) show that those 
movies tend to be connected to a larger number of other movies. 




Figure 2: MATLAB spy plots under the reverse Sparse reverse Cuthill-McKee ordering 
of the vertices of the sparse precision matrices obtained via PINE-GL , for three different 
values of the tuning parameters. A dot represents presence of an edge. The percentage of 
off-diagonal zeros in the matrix are given below each plot. 



C.5 Movie-ID to Movie mapping Table 

The movie ids — movie name mapping is given in the following table: 

7 For a sparse symmetric matrix A the reverse Cuthill-McKee ordering is a permutation n such that 
A(tt, 7t) tends to have its nonzero elements closer to the diagonal. This is often used for visualizing the 
sparsity structure of large dimensional graphs 
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(0) PuppetMaster5: TheFinalChapter (1994) 

(2) Carnosaur3: PrimalSpecies (1996) 

(4) Fridaythel3thPartV: ANewBeginning (1985) 

(6) Porky ' sRevenge (1985) 

(8) SororityHouseMassacre (1986) 

(10) PoliceAcademy5 : Assignment :MiamiBeach( 1988) 

(12)RockyIV(1985) 

(14) Hellbound:HellraiserII(1988) 

(16) CloseShave,A(1995) 

(18) Godfather :PartII, The (1974) 



(I) PuppetMaster4 (1993) 
(3) Carnosaur2(1995) 

(5) Fridaythel3thPartVII:TheNewBlood(1988) 
(7) Porky 's II: TheNextDay (1983) 
(9) SororityHouseMassacrell (1990) 

(II) PoliceAcademy6 : CityUnderSiege ( 1989) 
(13) Rockylll (1982) 

(15) Hellraiser(1987) 

(17) WrongTrousers,The (1993) 

(19) Godfather, The (1972) 



Table 4: Table showing thenames of the top 20 Movies, appearing in the top twenty 
strongest partial correlations. It is seen from Figure [1] that edges often occur between 
movies and their sequels. 
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